Main analysis
We present this exploratory data analysis in four sections as outlined in the introduction: artists, words, and audio features.
Artists
In this subsection, we examine the data at the artist level to answer some basic initial questions.
top_artists = df %>%
group_by(artist_base) %>%
summarize(num_singles = n()) %>%
arrange(desc(num_singles))
top_artists_30 = top_artists[0:30,]
ggplot(top_artists_30, aes(x = reorder(artist_base, num_singles), y = num_singles)) +
geom_col() +
coord_flip() +
xlab('Artist') +
ylab('Number of yearly top 100 singles from 1965-2015') +
labs(title = 'Who are the most popular artists of the past 50 years?')
most_explicit_artists = df %>%
group_by(artist_base) %>%
summarize(explicitness = sum(explicit)) %>%
arrange(desc(explicitness))
most_explicit_artists = most_explicit_artists[0:30,]
ggplot(most_explicit_artists, aes(x = reorder(artist_base, explicitness), y = explicitness)) +
geom_col() +
coord_flip() +
xlab('Artist') +
ylab('Number of yearly top 100 explicit singles from 1965-2015') +
labs(title = 'Who are the most popular explicit artists of the past 50 years?')
To measure the explicitness of each artist, we initially considered averaging the binary 0/1 explicit attribute of each song by artist. However, because many artists only have 1 single in the Yearly Top 100 throughoug their career, this results in many artists having an average explicitness of 1. Thus, we fall back to simply counting the number of explicit singles per artist. Interestingly, only the top 3 most explicit artists (Eminem, Ludacris, and Drake) are also among the top 30 most popular artists. Eminem and Ludacris are famously prolific and explicit; in fact we can see that all 15 and 14, respectively, of their top singles are explicit. More generally, we also notice that the vast majority of these singles are from the hip hop and rap genres.
most_featuring_artists = df %>%
mutate(is_collab = str_detect(artist, 'feat')) %>%
group_by(artist_base) %>%
summarize(num_collaborations = sum(is_collab)) %>%
arrange(desc(num_collaborations))
most_featuring_artists = most_featuring_artists[0:20,]
p1 = ggplot(most_featuring_artists, aes(x = reorder(artist_base, num_collaborations),
y = num_collaborations)) +
geom_col() +
xlab('Main artist') +
ylab('Number of top 100 singles featuring a guest artist') +
scale_y_continuous(breaks=1:9, labels=1:9) +
coord_flip()
matches = str_match(as.list(df['artist'])$artist, 'featuring\\s(.*)')
matches = matches[, 2]
matches = matches[!is.na(matches)]
matches = as_tibble(matches)
featured_artists = matches %>%
group_by(value) %>%
summarize(num_features = n()) %>%
arrange(desc(num_features))
featured_artists = featured_artists[1:20,]
p2 = ggplot(featured_artists, aes(x = reorder(value, num_features),
y = num_features)) +
geom_col() +
xlab('Featured artist') +
ylab('Number of top 100 singles featured as guest') +
scale_y_continuous(breaks=1:12, labels=1:12) +
coord_flip()
grid.arrange(p1, p2, ncol=2)
Many top singles are the product of a collaboration between two artists, with the resulting artist attribution of the song taking the form “X featuring Y”, where X denotes the main artist and Y denotes the guest/featured artist. Above, we see that Rihanna and Chris Brown most frequently feature other artists in their top singles, while Lil Wayne and T-Pain are the most frequent guests on other artists’ top singles. Particularly interesting is the fact that while Usher, Timbaland, Santana, and David Guetta are among the 20 most “featuring” artist, they are not among even the top 20 most “featured” artists. Conversely, T-Pain, Snopp Dogg, Nicki Minah, and Will.I.Am are among the 20 most “featured” artist despite not being among the 20 most “featuring” artists.
collaborations = df %>%
mutate(is_collab = str_detect(artist, 'feat')) %>%
group_by(year) %>%
summarize(num_collaborations = sum(is_collab))
ggplot(collaborations, aes(x = year, y = num_collaborations)) +
geom_line() +
geom_point() +
xlab('Year') +
ylab('Number of top 100 singles featuring a guest artist')
More interesting still is the rising trend in artist collaborations over time that begins during the 1990s and takes off dramatically before plateauing in the late 2000s. Additional comments about this trend are presented in the executive summary.
bin_width <- 1
cover <- read.csv("../data/Louis/cover_year.csv")
ggplot(cover, aes(year)) +
geom_histogram(color = "black", fill = "lightblue", binwidth = bin_width) +
theme(legend.position="bottom") +
ggtitle("Cover") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Year") +
ylab("Count")
The histogram above shows the number of covers each year. We define a cover as a song having the same title as a previous song in the dataset. Having removed the duplicates (artist and song comprise a unique pair now) is particularly helpful here. We can see that the number of covers increases over time, showing somehow nostalgia plays an important role in music industry. Reducing the binwidth makes that trend even clearer, we invite you to perform it on the Shiny App. All preprocessing can be found at: here.
nunique_artists_year = df %>%
group_by(year) %>%
summarize(nunique = n_distinct(artist_base))
options(repr.plot.width = 16, repr.plot.height = 6)
ggplot(nunique_artists_year, aes(x = year, y = nunique), fill='black') +
geom_line() +
geom_point() +
ylab('Number of unique artists')
The plot above shows artist diversity over time, with diversity defined not on the basis of race but on the number of unique artists with top 100 singles each year. The trend depicted shows a mixed picture; the most diverse year is in the early 1970s while the least diverse is in 2009 and 2010, yet the pattern remains noisy enough that more years are needed before making a more confident determination.
top_artists = df %>%
group_by(artist_base) %>%
summarize(num_singles = n(), earliest_hit = min(year), latest_hit = max(year), longevity = latest_hit - earliest_hit, hits_per_year = num_singles / longevity) %>%
arrange(desc(longevity))
top_artists_30 = top_artists[0:30,]
ggplot(top_artists_30) + geom_segment(aes(x=earliest_hit, xend=latest_hit, y=reorder(artist_base, longevity), yend=reorder(artist_base, longevity), color=num_singles), size=5) + geom_text(aes(x=latest_hit + 3, y=reorder(artist_base, longevity), label=paste(longevity, 'years'))) + ggtitle('Career spans of most timeless artists')
Besides just revealing which artists had the most top 100 hit singles, one of the more interesting aspects of the data is that it allows us to see which artists had the greatest longevity of career, defined simply as number of years between an artist’s earliest charting hit and most recent charting hit. A caveat here is that because the latest of any duplicate singles was removed from the dataset, some artists’ career lifespans appear shorter by a year than if duplicates singles had not been removed.
Above, we not only observe which artists had the greatest career longevity but also the time period during which an artist was popular, with lighter colors signifying an artist with greater numbers of total hit singles. Curiously, while Madonna and Mariah Carey have charted most frequently in the top 100, several artists, Santana, Cher, The Isley Brothers, Aretha Franklin, etc., have had longer charting careers.
This raises a natural question: does there exist a relationship between the number of singles an artist produces per year and their career span? Below, we observe that no artist has produced an average of greater than 2 hit singles a year and achieved a career span exceeding a decade. In fact, all artists generating on avearage more than 2.5 hit singles per year have career spans of less than 5 years, with the most significant “flash in the pan” artists being those that averaged 4 or more hit singles per year.
ggplot(top_artists, aes(x = longevity, y = hits_per_year, color = num_singles)) + geom_point(alpha=0.25)
However, being a “flash in the pan” artist with multiple hit singles over a short span of time might still be preferable to the fates of the majority of charting artists. In the cumulative relative frequency histogram below, we see that over half of all artists are “one hit wonders”, only ever generating a single hit, and over 3/4ths of artists ever generate at most 2 hits.
p1 = ggplot(top_artists) + geom_histogram(aes(x=num_singles))
p2 = ggplot(top_artists) + geom_histogram(aes(x=num_singles, y=cumsum(..count../sum(..count..))))
grid.arrange(p1, p2, ncol=2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Audio features
In this subsection, we turn our focus to the audio features obtained from Spotify for each song, particularly how these features are related to one another as well as how they change over time. Originally, we aggregated each feature by year and plotted how the mean evolves over time. Not satisfied with the resulting loss in information by aggregating using solely the mean, we then incorporated additional aggregations such as the maximum and the minimum for each year before ultimately deciding to create box plots for each year to minimize information loss.
#Duration
spotifydf <- df%>%
group_by(year)
ggplot(spotifydf) + geom_boxplot(aes(year, duration_min, group=year))+
ggtitle("Duration over time")+
ylab("Duration (in Minutes)")+
scale_x_continuous(breaks = seq(1960,2020,5))
From above, we observe a gradual increase in the durations of top 100 singles that peaks at 1990 with median song lengths just shy of 5 minutes before starting on a downward trend to under 4 minutes. This trend of declining song durations coincides with a decline in the variation of song lengths. One possible explanation is that as the music industry has become increasingly competitive, artists are forced to grab their audience’s attention as quickly as possible, and as listeners gain access to an ever-expanding catalogue of songs, their attention spans are decreasing.
#Acousticness
ggplot(spotifydf) + geom_boxplot(aes(year, acousticness, group=year)) +
ggtitle("Acousticness over time") +
ylab("Acousticness") +
scale_x_continuous(breaks = seq(1960,2020,5))
Top 100 songs have been trending downward in acousticness from 1965 to 2015. This trend might also be attributed to an increasingly competitive music industry; live studio musicians and indeed recording studios themselves are much more expensive in contrast to hardware synthesizers, and later, software synthesizers on which artists can create entire songs using only a laptop computer.
#Danceability
ggplot(spotifydf) + geom_boxplot(aes(year, danceability, group=year)) +
ggtitle("Danceability over time")+
ylab("Danceability ")+
scale_x_continuous(breaks = seq(1960,2020,5))
While the median danceability of songs has remained relatively stable from 1985 to 2015, there exists a noticeable increase from 1965 to 1984.
#Energy
ggplot(spotifydf) + geom_boxplot(aes(year, energy, group=year)) +
ggtitle("Energy over time")+
ylab("Energy ")+
scale_x_continuous(breaks = seq(1960,2020,5))
Overall, there appears that songs are becoming more energetic and busier over time, though the signal is quite noisy. Most interestingly, from 1985 onwards, the change in median energy over time appears to even be cyclical.
#Instrumentalness
ggplot(spotifydf) + geom_boxplot(aes(year, instrumentalness, group=year)) +
ggtitle("Instrumentalness over time")+
ylab("Instrumentalness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
Unsurprisingly, the median instrumentalness of top 100 singles consistently stays at or near 0, however we can observe a noticeable thinning out of outliers, suggesting that instrumental music is becoming increasingly unpopular. Additionally, this decline in the instrumentalness of songs could be tied to the decline of song lengths; if songs are becoming shorter because of declining listener attention spans, then one is likely to also observe songs with increasingly shorter instrumental introductions and interludes as artists favor “getting to the point.”
#Liveness
ggplot(spotifydf) + geom_boxplot(aes(year, liveness, group=year)) +
ggtitle("Liveness over time")+
ylab("Liveness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
There are no discernable trends in how the distribution of songs’ liveness changes over time.
#Loudness
ggplot(spotifydf) + geom_boxplot(aes(year, loudness, group=year)) +
ggtitle("Loudness over Time")+
ylab("Loudness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
Top 100 singles are unquestionably becoming much louder. We expand on an explanation of the “Loudness Wars” in the executive summary.
#Speechiness
ggplot(spotifydf) + geom_boxplot(aes(year, speechiness, group=year)) +
ggtitle("Speechiness over Time")+
ylab("Speechiness ")+
scale_x_continuous(breaks = seq(1960,2020,5))
Also pronounced is a dramatic increase in speechiness starting from 1990 and peaking in 2004 which may correspond with the increased popularity of rap over that period of time.
#Tempo
ggplot(spotifydf) + geom_boxplot(aes(year, tempo, group=year)) +
ggtitle("Tempo over Time")+
ylab("Tempo ")+
scale_x_continuous(breaks = seq(1960,2020,5))
There are no discernable trends in how the distribution of songs’ tempos changes over time.
#Valence
ggplot(spotifydf) + geom_boxplot(aes(year, valence, group=year)) +
ggtitle("Valence over Time")+
ylab("Valence")+
scale_x_continuous(breaks = seq(1960,2020,5))
There is a general trend of songs becoming increasingly less positive, perhaps reflecting the increase in cultural and economic pressure society has been experiencing.
#Verbosity
ggplot(spotifydf) + geom_boxplot(aes(year, words_per_sec, group=year)) +
ggtitle("Verbosity over Time")+
ylab("Words per second")+
scale_x_continuous(breaks = seq(1960,2020,5))
Corresponding to the trends involving speechiness over time, we see that the number of words heard per second in songs has increased over time. Indeed, one should expect this given the previously observed speechiness trends as it is much easier to speak quickly than to sing quickly.
keepcols <- c('year', 'acousticness', 'danceability', 'duration', 'energy', 'instrumentalness', 'liveness','loudness','speechiness', 'tempo', 'valence')
spotifydf<-spotifydf%>%
filter(!is.na(duration_ms))%>%
mutate(duration=duration_ms/1000/60)
spotifydf_s <- spotifydf%>%
dplyr::select(keepcols) %>%
group_by(year)%>%
summarize(mean_acousticness = mean(acousticness),
mean_danceability = mean(danceability),
mean_duration = mean(duration),
mean_energy = mean(energy),
mean_instrumentalness = mean(instrumentalness),
mean_liveness = mean(liveness),
mean_loudness = mean(loudness),
mean_speechiness = mean(speechiness),
mean_tempo = mean(tempo),
mean_valence = mean(valence)
)
# %>%
# gather(key='variable', value = 'Freq', -year)
spotifydf_s$year<- factor(spotifydf_s$year, levels = unique(spotifydf_s$year))
ggparcoord(spotifydf_s, columns = 2:11, alphaLines = 0.7, groupColumn ='year',scale = 'uniminmax')+xlab("")+ylab("") + theme(axis.text.x = element_text(angle=90))
The parallel coordinates plot above was generated by creating an audio feature profile for each year in which we simply average each audio feature across all songs for each year. In other words, each line represents the average Spotify audio features for that year. We considered using a perceptually uniform color space to denote years, however we found that distinguishing the inner decades under such a scheme to be more difficult, so we stayed with the default. The features have also been re-scaled so that the maximum feature value is 1 and minimum is 0. Some notable observations: - Songs from 1966 to 1975 are markedly more acoustic and less danceable than those from later years, which also display a trend of decreasing acousticness over time. - Songs from the 1990s seem to have had the longest duration - Songs from the most recent decade appear to be among the highest energy, least instrumental, and loudest - Songs from the 1980s appear to be much less speechy than from other decades
To examine the correlation structure of the data’s continuous features, we create a correlation matrix and visualize it with a heatmap below. Additionally, we arrange the features so that those with the most similar correlations to other features are placed near each other, with the dendrograms on the top and left showing which features are most similar to each other.
columns = c('rank', 'year', 'acousticness', 'danceability', 'duration_min',
'energy', 'instrumentalness', 'liveness', 'loudness', 'popularity',
'speechiness', 'tempo', 'valence', 'words_per_sec')
df_cor = df[columns]
correlation = cor(df_cor, method='pearson', use='pairwise.complete.obs')
col = colorRampPalette(c('red', 'white', 'green'))(20)
heatmap(x = correlation, col = col, symm = TRUE)
Note: This heatmap below was originally created in Python using the Seaborn visualization library. In translating the visualization to R using R’s heatmap functionality, we were unable to produce a legend nor annotate the grid cells individually. We present this second, more ideal heatmap as an image in addition to the one created in R above.
The dendrogrammed correlation heatmap reveals some interesting aspects of the data’s correlation structure:
- Popularity and year are weakly correlated. This makes intuitive sense given that the audience on Spotify are more likely to listen to more current music and that people who would enjoy the older music might be less likely to listen to it on Spotify.
- Loudness and year are also weakly correlated, as previously observed.
- Loudness and energy are moderately correlated.
- Acousticness and energy are weakly anticorrelated. This could be explained by the fact that ballads and other slow music are more likely to feature acoustic instruments.
- Valance and danceability are weakly correlated. This could suggest that for the most part, people are less inclined to dance to sad or angry music.
- Words per second is weakly correlated with speechiness. Indeed, it is easier to talk fast than to sing fast.
The correlation heatmap provides guidance as to which pairs of features are worth investigating further.
p1 = ggplot(df, aes(x=acousticness, y=energy)) +
geom_point(alpha=0.15) +
facet_wrap(~ decade)
p2 = ggplot(df, aes(x=acousticness, y=energy, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Acousticness vs. energy')
grid.arrange(p1, p2, ncol = 2)
It appears that most songs have very low acousticness (below 0.125) and high energy (above 0.5). However, outside this dense region, as a song increases in acousticness, its energy appears to decrease quadratically. Furthermore, it appears that these less common high-acoustic/lower-energy songs are predominantly older, with the vast majority of more recent songs occupying the low-acoustic/higher-energy region. We can observe this more clearly by faceting on decade. The facetted scatterplots confirm more clearly that indeed songs are trending towards lower acousticness and higher energy with each passing decade (which was also observed in the time series box plots). It should be noted that because the time range of the dataset is from 1965-2015, the facets for the 1960s and 2010s have half as many data points as the facets for other decades.
p2 = ggplot(df, aes(x=danceability, y=valence, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Danceability vs valence')
p1 = ggplot(df, aes(x=danceability, y=valence)) +
geom_point(alpha=0.15) +
facet_wrap(~ decade)
grid.arrange(p1, p2, ncol = 2)
p2 = ggplot(df, aes(x=speechiness, y=words_per_sec, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Speechiness vs verbosity')
p1 = ggplot(df, aes(x=speechiness, y=words_per_sec)) +
geom_point(alpha=0.15) +
facet_wrap(~ decade)
grid.arrange(p1, p2, ncol = 2)
Above, we observe that speechiness and verbosity (words per second) are indeed positively correlated, albeit moderately and that top 100 singles have become increasingly more speechy and verbose starting from the 1990s when rap and hip hop first gained mainstream appeal.
p1 = ggplot(df, aes(x=energy, y=loudness, color=year)) +
geom_point(alpha=0.5) +
ggtitle('Energy vs. loudness')
p2 = ggplot(df, aes(x=energy, y=loudness)) +
geom_point(alpha=0.15) +
facet_wrap(~ decade)
grid.arrange(p2, p1, ncol = 2)
Words
In this subsection, we perform analyses on the words within the lyrics of each song. In particular, given the rich set of features provided by Spotify, we can explore interesting questions such as “what are the most danceable words?” To facilitate such analysis, we use Python to tidy our dataset at the word-level. Whereas the original song-level dataset contains one row per song, the word-level dataset contains one row per unique word per song. For example, if the word “dance” appears multiple times in a single song, it would be transformed into a single row in the new dataset, but if the word “dance” appears in X songs, it would be transformed into X rows in the dataset. Each row of the word-level dataset inherits all the features of the song it belongs to (year, title, artist, audio features) and contains a new feature “count” indicating the number of times the word appears in that song. Furthermore, only words that appear in at least 10 different songs are included in the new dataset. Otherwise, if we attempt to answer “what are the most danceable words” by averaging the danceability of songs that word appears in, words that appear only in the single most danceable song would dominate such a ranking.
We performed an n-gram analysis of the tidy dataset. Thanks to scikit-learn package (python), we collected the top 100 per count unigrams, bigrams and trigrams per decade (1960s to 2010s). Preprocessing can be found at here. To visualize them, it appeared that building word clouds was the most effective way. We will analize them in the executive summary. Let’s start with all our n-gram (unigram, bigram, trigram) by decade:
n_gram <- read.csv("../data/Louis/n_gram.csv")
pal <- brewer.pal(9, "OrRd")
pal <- pal[-(1:3)]
for (i in 0:5){
n_gram_i <- filter(n_gram, n_gram$decade == 1960+10*i)
wordcloud(n_gram_i$word, n_gram_i$count, min.freq =3, scale=c(5, .2), random.order = FALSE, random.color = FALSE, colors= pal)}
Unigrams are dominant in those clouds, which was expected. Let’s focus on them:
for (i in 0:5){
n_gram_i <- filter(n_gram, n_gram$decade == 1960+10*i, n_gram$gram == 'unigram')
wordcloud(n_gram_i$word, n_gram_i$count, min.freq =3, scale=c(5, .2), random.order = FALSE, random.color = FALSE, colors= pal)}
Let’s have a look on bigrams only:
for (i in 0:5){
n_gram_i <- filter(n_gram, n_gram$decade == 1960+10*i, n_gram$gram == 'bigram')
wordcloud(n_gram_i$word, n_gram_i$count, min.freq =3, scale=c(5, .2), random.order = FALSE, random.color = FALSE, colors= pal)}
Finally trigrams:
for (i in 0:5){
n_gram_i <- filter(n_gram, n_gram$decade == 1960+10*i, n_gram$gram == 'trigram')
wordcloud(n_gram_i$word, n_gram_i$count, min.freq =3, scale=c(5, .2), random.order = FALSE, random.color = FALSE, colors= pal)}
Another aspect of world analysis is the evolution of a word count in the billboard over years. Our interest is dual: to observe if there is any correlation between word counts and society events. On the other hand, to determine if these trends could be useful to decrypt people’s attitude evolution over time. Let’s look at 4 pair of words: (‘love’, ‘like’), (‘girl’, ‘boy’), (‘woman’, ‘man’), (‘war’, ‘peace’). Additional analysis will be performed in the executive summary. All preprocessing can be found at: here.
word_set <- list(list('love', 'like'), list('girl', 'boy'), list('woman', 'man'), list('war', 'peace'))
word_count <- read.csv("../data/Louis/word_count_per_year.csv")
for (i in 1:4){
word1 <- word_set[[i]][[1]]
word2 <- word_set[[i]][[2]]
word_filter = filter(word_count, word_count$word == word1 | word_count$word == word2)
print(ggplot(word_filter, aes(x = year, y = count, col = word)) +
geom_line() +
theme(legend.position="bottom") +
ggtitle(paste(word1,word2, sep="/")) +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Year") +
ylab("Count"))
}
Working on a word level dataset allowed us to perform audio feature analysis. We focused on energy, danceability and explicitness of words. Results are presented in wordclouds, some of them might be suprising.
word_df <- read.csv("../data/tidy-words.csv")
most_energetic_words = word_df %>%
group_by(word) %>%
summarize(energy_med = median(energy)) %>%
filter(!is.na(energy_med)) %>%
top_n(100, energy_med) %>%
mutate(energy_scale = (energy_med - min(energy_med))*100/max(energy_med))
least_energetic_words = word_df %>%
group_by(word) %>%
summarize(energy_med = median(energy)) %>%
filter(!is.na(energy_med)) %>%
top_n(100, desc(energy_med)) %>%
mutate(energy_scale = 100*(1 - (energy_med - min(energy_med))/max(energy_med)))
wordcloud(most_energetic_words$word, most_energetic_words$energy_scale, scale=c(2, .2), random.order = FALSE, random.color = FALSE, colors= pal)
wordcloud(least_energetic_words$word, least_energetic_words$energy_scale, scale=c(1, .1), random.order = FALSE, random.color = FALSE, colors= pal)
most_danceable_words = word_df %>%
group_by(word) %>%
summarize(danceability_med = median(danceability)) %>%
filter(!is.na(danceability_med)) %>%
top_n(100, danceability_med) %>%
mutate(danceability_scale = (danceability_med - min(danceability_med))*100/max(danceability_med))
least_danceable_words = word_df %>%
group_by(word) %>%
summarize(danceability_med = median(danceability)) %>%
filter(!is.na(danceability_med)) %>%
top_n(100, desc(danceability_med)) %>%
mutate(danceability_scale = 100*(1 - (danceability_med - min(danceability_med))/max(danceability_med)))
wordcloud(most_danceable_words$word, most_danceable_words$danceability_scale, scale=c(2, .2), random.order = FALSE, random.color = FALSE, colors= pal)
wordcloud(least_danceable_words$word, least_danceable_words$danceability_scale, scale=c(1, .1), random.order = FALSE, random.color = FALSE, colors= pal)
most_explicit_words = word_df %>%
group_by(word) %>%
summarize(explicit_mean = mean(explicit)) %>%
filter(!is.na(explicit_mean)) %>%
top_n(100, explicit_mean) %>%
mutate(explicit_scale = (explicit_mean - min(explicit_mean))*100/max(explicit_mean))
wordcloud(most_explicit_words$word, most_explicit_words$explicit_scale, scale=c(2, .2), random.order = FALSE, random.color = FALSE, colors= pal)